Global Computing platforms, large scale clusters and fu-ture TeraGRID systems gather thousands of nodes for com-puting parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatil-ity reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint / roll-back and distributed message logging. MPICH-V architec-ture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channel...
The running times of large–scale computational science and engineering parallel applications, execut...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceA long-term trend in high-performance computing is the increasing number of no...
ISBN: 0-7695-152International audienceGlobal Computing platforms, large scale clusters and future Te...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with th...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
International audience— As reported by many recent studies, the mean time between failures of future...
The running times of large–scale computational science and engineering parallel applications, execut...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceA long-term trend in high-performance computing is the increasing number of no...
ISBN: 0-7695-152International audienceGlobal Computing platforms, large scale clusters and future Te...
International audienceHigh performance computing platforms such as Clusters, Grid and Desktop Grids ...
This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application ...
Abstract. Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different ...
Abstract—As computational clusters increase in size, their mean-time-to-failure reduces drastically....
Abstract. As computational clusters increase in size, their mean-time-to-failure reduces. Typically ...
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)...
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with th...
Abstract — Nowadays, clusters and grids are made of more and more computing nodes. The programming o...
Fault tolerance in parallel systems has traditionally been achieved through a combination of redunda...
International audience— As reported by many recent studies, the mean time between failures of future...
The running times of large–scale computational science and engineering parallel applications, execut...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure...
International audienceA long-term trend in high-performance computing is the increasing number of no...